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Abstract 

Most biochemical reactions in a cell are regulated by highly specialized proteins, which are the prime mediators of 
the cellular phenotype. Therefore the identification, quantitation and characterization of all proteins in a cell are of 
utmost importance to understand the molecular processes that mediate cellular physiology. With the advent of 
robust and reliable mass spectrometers that are able to analyze complex protein mixtures within a reasonable 
timeframe, the systematic analysis of all proteins in a cell becomes feasible. Besides the ongoing improvements of 
analytical hardware, standardized methods to analyze and study all proteins have to be developed that allow the 
generation of testable new hypothesis based on the enormous pre-existing amount of biological information. Here 
we discuss current strategies on how to gather, filter and analyze proteomic data sates using available software 
packages. 



Background 

Proteins are involved in almost all physiological aspects 
of cellular life from the catalysis of biochemical reac- 
tions within the intermediary metabolismn to the pro- 
cessing and integration of internal and external signals. 
The misregulation of protein expression results in 
pathological states such as cancer, neurodegenerative 
diseases and metabolic imbalances. Proteins are synthe- 
sized by translating the information encoded in a RNA 
molecule to a polypeptide chain, which adopts a specific 
three dimensional structure. Proteins are subjected to a 
constant turnover making protein homeostasis a very 
important feature of their regulation. Many proteins 
function within large multimeric complexes that are 
highly dosage dependent. The recent developments in 
gathering large scale genomic, transcriptomic and pro- 
teomic data pose substantial challenges to the bioinfor- 
matic processing of these data, which have yet not been 
completely solved. In case of the proteomic investiga- 
tion, the challenges occur at all levels ranging from sam- 
ple preparation and data gathering over the raw data 
integration and database searching to the functional 
interpretation of large datasets. Based on a iterative 
strategy of proteomic analysis, data interpretation and 
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sytstematic challenges, hypothesis can be developed and 
modified, which will eventually lead to the generation of 
new knowledge (Figure 1). 

Mass spectrometry data analysis 

All proteins from a sample of interest are usually 
extracted and digested with one or several proteases 
(typically trypsin alone or in combination with Lys-C 
[1]) to generate a defined set of peptides. Several enrich- 
ment and fractionation steps can be introduced at pro- 
tein or peptide level in this general workflow when 
sample complexity has to be reduced or when a specific 
subset of proteins/peptides should be analysed (i.e. orga- 
nelle specific proteome [2,3] or substoichiometric post- 
translational modified peptides [4]). 

The peptides obtained are subsequently analysed by 
liquid chromatography coupled to mass spectrometry 
(LC-MS). The two most common approaches here are 
either designed to achieve a deep coverage of the pro- 
teome (shotgun MS [5]) or to collect as much quantitative 
information as possible for a defined set of proteins/pep- 
tides (targeted MS [6]). During the analysis peptides elut- 
ing from the chromatography are selected according to 
defined rules (see below) and further fragmented within 
the mass spectrometer. The resulting tandem mass spectra 
(MS 2 ) provide information about the sequence of the pep- 
tide, which is key to their identification. For a shotgun 
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Figure 1 Integrated Proteomic Workflow: Samples of interest 
are subjected to protein extraction and digestion The resulting 
peptides are separated by C18 chromatography and directly 
electrosprayed into the mass spectrometer, where their mass-to- 
charge ratio and fragmentation spectra is recorded. MS data is 
analysed to identify and quantify the detected peptides, and 
assemble it to proteins. Once the proteomics analysis per se is 
finished, the functional analysis of the relevant differential proteins 
may unmask pathways, interactions, PTM's relevant for the biological 
question of interest. This in silico information can be used to 
formulate new hypothesis that could be eventually used to 
interrogate the biological system again. 



approach, no prior knowledge of the peptides present in 
the sample is required to define peptide selection criteria 
during the LC-MS analysis. Therefore, the peptides eluting 
from the chromatographic column are identified in a data- 
dependent mode [7], where continuously the N most 
abundant peptides at a given retention time are selected 
for fragmentation and their masses excluded for further 
selection during a defined time. By using this dynamic 
exclusion [8], less abundant peptides are also selected for 
fragmentation. 

The data can be displayed as a 3-D map with the 
mass-to-charge ratios (m/z), retention times (RT) and 
intensities for the observed peptides as axis, together 
with fragmentation spectra (MS 2 ) for those peptides that 
were selected during any of the data dependent cycles. 
The intensity of a certain peptide m/z can be plotted 
along the RT to obtain the corresponding chromato- 
graphic peak. The area under this curve (AUC) can be 
employed to quantify the corresponding peptide. On the 
other hand, the peptide identification is achieved 
through its fragmentation spectrum. 

The large number of MS spectra generated by the 
last generations of mass spectrometers requires auto- 
mated search engines capable of identifying and quanti- 
fying the analysed peptides. It is not the aim of this 



review to detail the existing algorithms (see [9] for this 
purpose), but to give a general idea how they work and 
which kind of data should be expected from them. 
Briefly, search algorithms aim to explain a recorded MS 2 
spectrum by a peptide sequence from a pre-defined 
database, returning a list of peptide sequences that fit to 
the experimental data with a certain probability score or 
false discovery rate (FDR). The databases are normally 
protein databases translated from genomic data [10], 
although other strategies like spectral libraries [11] or 
mRNA databases [12] have been successfully applied. A 
final step is then required to assemble the identified 
peptides into proteins, which can be challenging, in 
particular when dealing with redundant peptides or 
alternatively spliced proteins [13]. In any of these cases, 
several strategies have been described to reduce the 
false discovery rate of such matching approaches both at 
peptide identification and protein assembling level [14]. 

This general shotgun/discovery approach leads to the 
identification of thousands of proteins with a dynamic 
range of 10 4 -10 5 [15] within a complete cellular lysate. 
However, this method presents still two main drawbacks: 
sensitivity and reproducibility. Normally, complete cover- 
age of proteins and complexes involved in the same sig- 
naling pathway or belonging to the same functional 
family is not achieved. Additionally, reproducibility in 
protein identification among replicates can vary between 
30 and 60% [16,17]. These limitations have been success- 
fully addressed by the so-called targeted proteomics [6]. 
This approach is based on a general method called 
selected reaction monitoring (SRM), where predefined 
peptides at scheduled RT are selected and fragmented, 
and two or three fragments monitored. Due to the 
increased scan speed and mass window selectivity of the 
current mass analyzers, SRM can be simultaneously per- 
formed on multiple analytes. This capability lead to the 
multiplexing of SRMs in a method called multiple reac- 
tion monitoring (MRM). The multiplexing capability 
have been used to quantify several hundreds of proteins 
in a broad dynamic range, down to proteins present at 
very low copy number in the cell (~50 copies/cell) in the 
background of the whole range of protein concentration 
in eukaryotic cells [18,19]. 

The AUC of the monitored fragments can then be used 
for quantification. By spiking the peptide mixture with 
isotopically labelled standard peptides, such targeted 
approaches can also be used to determine absolute rather 
than relative quantitation levels of proteins [20] or post- 
translational modifications [21]. However, as previous 
knowledge about the proteins is required, such targeted 
approaches are usually performed in combination or sub- 
sequent to a shotgun approach. Similarly to the genomic 
data, shot gun proteomic studies can also be uploaded to 
dedicated proteome repositories [22], which can also be 
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used for database searching. The cooperation of the lar- 
gest repositories PRIDE, Proteome Commons and Pep- 
tide Atlas within the Proteome Exchange project http:// 
www.proteomeexchange.org allow direct access to most 
of the stored proteomic datasets and provides a highly 
valuable source for bioinformatics data mining [23-25]. 

GO Term identification and enrichment analysis 

The output of a proteome analysis either in a shotgun 
approach or a more targeted method is usually a long list 
of identified factors, that have a probability score and 
ideally also a quantitative value associated with them. In 
order to understand and interprete these data and to 
generate testable hypothesis on the systemic response of 
the proteome to a challenge, the list has to be further clas- 
sified and filtered. The first step for a functional analysis 
of a large protein list is to connect the protein name 
to a unique identifier. While gene names have been 
standardized, protein names can differ between different 
databases and even releases of the same database. 
Although many of the large databases have been curated 
throughout the recent years, this can pose quite a 
bioinformatic challenge and can lead to a substantial loss 
of information. Several web-based algorithms exist to 
connect protein names to their corresponding gene 
names, such as PICR or CRONOS [26];[27]. However 
some functional databases like the Uniprot knowledge 
base, Ensembl or the outdated IPI number (International 
Protein Index) [28-30] can use protein identifiers as input. 

A first step for functional interpretation of the resultant 
protein list is to connect the protein identifier with its 
associated Gene Ontology terms (http://www.geneontol- 
ogy.org, [31]). Introduction of the Gene Ontology helped 
to overcome the redundancy in terminology for biological 
processes [32]. Thereby, genes are associated to hierarchi- 
cally clustered, functional terms that describe the "biologi- 
cal process", "molecular function" or "cellular component" 
which have a unique identification number. A specific GO 
term can be related to more than one parent terms, as 
long as the whole structure resembles an acyclic graph. 
This list of terms is not yet complete and changes with 
new discoveries, making GO terms redundant or obsolete. 
Another drawback of the use of GO terms for functional 
annotations is the fact that most (95%) of the GO terms 
annotations are done computational, while the minority is 
manually curated and based on experimental details [32]. 
For single proteins the simplest way to perform a GO 
term annotation is to look up the corresponding terms 
with the Amigo tool provided on the GO website [33]. For 
larger data sets and sytstematic approaches some database 
search algorithms for proteomic data such as MaxQuant, 
Proteome Discoverer and Xltandem [34,35] have imple- 
mented a GO-term annotation step. As not all protein 
entries are fully annotaed with the corresponding GO 



terms, it is possible to retrieve GO-terms from the closest 
related protein via BLAST similarity search in the BLAS- 
T2GO tool [36]. 

The first step after GO-term annotation is a GO-term 
enrichment analysis to compare the abundance of speci- 
fic GO-terms in the dataset with the natural abundance 
in the organism or a reference dataset, e.g. different cell 
lines, inhibitor treatment or growth states [37]. To 
extract functions that are significantly enriched in one 
sample over a second dataset, a p-value is calculated 
based which shows overrepresentation of a specific 
GO term, thereby it is necessary to cluster related 
GO-terms. This calculation can be done by most of the 
previously mentioned programs, but there is a plethora 
of other, mostly web-based software tools available ([38] 
http://neurolex.org/wiki/Category:Resource:Gene_Onto- 
logy_Tools). For instance, the DAVID and Babelomics 
software resources are often mentioned when it is 
necessary to analyze large gene list but currently there 
are more than 60 tools calculating GO term enrichment 
[38-40]. Most of these tools can be classified into three 
different types of enrichment algorithms, with singular 
enrichment analysis (SEA) being the most simple 
algorithms that test one anotation term at a time for a 
list of interesting genes [41]. GOStat, BinGO, or 
EasyGO are based on SEA algorithms. More sophisti- 
cated algorithms are gene set enrichment algorithms 
(GSEA) that take all genes of analysis into account, not 
only gene with significant change of abundance. 
Nevertheless, GSEA requires a quantitative measure- 
ment to rank the genes and is used in GSEA/P-GSEA 
and Gene Trail. Finally, modular enrichment analysis 
(MEA) include relationships between anotation terms 
which prevents loss of important biological correlations 
due to lacking relationships and reduces redundancy 
[41]. Those programs are not only limited to GO term 
enrichment, but they have also modules to search for 
protein networks (see below), convert protein identifiers, 
as well as link to further information and publications 
that substantiate the observed gene function. Especially 
the DAVID software resources offer a plethora of other 
tools for instance for gene and anotation term cluster- 
ing, mapping of genes to pathways and diseases as well 
as advanced statistics. A second important choice for 
the result of GO term enrichment is the reference 
dataset, which is either predefined by the tool, for 
instance all genes of the organism, or can be selected 
manually (all identified proteins) [42]. Weinert et al. 
have applied the DAVID GO term enrichment algo- 
rithm to study conservation of acetylation sites between 
human and drosophila from the extracted GO-terms 
of acetylated proteins [43]. In their study, they showed 
the conservation of protein acetylation in the respiratory 
chain, translational processes, but also in ubiquitinating 



Schmidt et al. BMC Systems Biology 2014, 8(Suppl 2):S3 
http://www.biomedcentral.eom/1752-0509/8/S2/S3 



Page 4 of 7 



enzymes. Bates et al. could show that the Abl-kinase 
dependent reprogramming of B-cells is to a major part 
post-transcriptionally regulated, by comparing the abun- 
dance of mRNA levels with protein abundance upon 
imatinib inhibitor treatment [44]. Despite the usefulness of 
GO terms for a functional annotation and filtering of large 
proteomic data sets the assignement is highly dependent 
on the algorithm used for annotation. Recently, fourteen 
GO enrichment algorithms have been tested on the same 
dataset. The obtained results showed a rather high discre- 
pancy for p-values of certain GO terms [42]. 

Pathway analysis 

A pathway describe the series of chemical reactions in the 
cell that lead to an observable biological effect. Proteins 
involved in the chemical reaction and those that have 
regulatory influence are combined in so-called pathway 
databases. Similarly to the previously described GO term 
enrichment analysis, protein or gene lists can also be 
scrutinized for pathway abundances which might be more 
meaningful because it moves the data interpretation away 
from the gene-centric view towards the identification of 
functional biological processes. Furthermore, functionally 
independent proteins can share some GO term associa- 
tions, for instance for very general terms such as "binding" 
or "cytoplasmic". A high number of resources and data- 
bases is available to extract pathway constraints from 
biological data (Figure 1). Comprehensive pathway 
databases such as KEGG, Reactome, Ingenuity Pathway 
Knowledge Base or BioCarta include a high number of 
diverse interaction data, which could arise from intracellu- 
lar reactions such as metabolism or signaling pathways, 
genetic interactions or drug development studies [45-47]. 
Apart from the comprehensive resources, highly specific 
databases have been developed for signal transduction 
processes such as PANTHER, GenMAPP or PID [48-50]. 
Recently, several databases were created which comprise 
pathways active in cancer. Such databases like Netpath 
[51], should help to identify cancer relevant proteins and 
genes from a complex dataset. In fact, public databases 
share a high degree of connectivity, allowing rapid distri- 
bution of novel findings. A comprehensive list of more 
than 300 pathway and interaction data resources can be 
found on the pathguide website http://pathguide.org [52]. 
Nowadays enrichment analyses are available with almost 
all pathway database resources, so that a list of signifi- 
cantly altered proteins, with regard to abundance and/or 
post-translational modifications, is sufficient to extract 
data on pathway abundance. However, similar to the GO 
term annotation the identification of pathways affected 
under certain conditions is highly dependent on the 
algorithm. Miiller and colleagues published a comparison 
of the Ingenuity Pathway Analysis (IPA) software and 
String for the analysis of several artificial datasets [53]. 



The tested datasets consisted of core proteins and 
associated proteins of 5 different pathways, Wnt, App, and 
Ins signaling, mitochondrial apoptosis as well as tau 
phosphorylation, respectively, which were retrieved from 
literature mining and a set of background proteins from 
proteomic analysis of HEK293 cells that that were falsely 
annotated as significantly regulated proteins in several 
repeats. They reported similar results for both pathway 
analysis algorithms, but also that neither algorithm could 
reach a sufficient p-value for reliable pathway enrichment. 
Additional features in IPA, such as annotation of protein 
family and localization, might help the experienced 
researcher to identify falsely annotated protein hits. 

Analysis of protein-protein-interactions 

The majority of proteins do not act as independent enti- 
ties. They form either transient or stable complexes with 
other proteins that act as scaffolds or regulate the protein 
activity. A protein can be involved in mulitiple complexes 
of varying composition and to completely understand a 
biological system it is necessary to analyze the abundant 
protein complexes as well as the conditions that lead to 
their formation or dissociation. Information on protein 
interactions in complexes is deposited in interaction 
databases such as MINT, BioGRID, IntAct or HRPD 
[54-57], associated with the biological process in which 
they are functionally important. Not all annotated inter- 
actions in public databeases are based on experimental 
observations. Dependent on the database used one can 
find a rather high percentage of predicted interactions 
and interactions based on literature mining such as 
STRING or iRefWeb [37,58,59]. For this purpose, a vari- 
ety of literature mining tools to screen PubMed abstracts 
has been developed of which chilibot and sciminer are 
most favored [60,61]. These interactions are the result of 
sophisticated algorithms that are trained on the existing 
set of protein-protein interactions. Furthermore, most 
large interaction databases have implemented simple 
algorithms that allow mapping of interaction proteins on 
the resource website. 

Protein interactions are often displayed as large inter- 
action networks illustrating the high degree of connectiv- 
ity nand the presence of promiscous hub proteins. A 
widely used resource for interaction data is STRING, 
which is not only a database itself, but connects to several 
other data resources to and is therefore also capable of 
literature mining [59,62]. Further, STRING is also cap- 
able of drawing simple protein networks based on the 
provided gene list and the available interactions in its 
databases. Cytoscape has evolved as a powerful graphical 
tool to draw interaction networks of high complexity and 
for incorporation and comparison of datasets from differ- 
ent experimental procedures. Cytoscape has only limited 
information stored, but interconnects excessively to other 
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databases to obtain information. Recently, EnrichNet was 
launched, a web-based platform integrating pathway and 
interaction analysis in 6 different databases (KeGG, Bio- 
Carta, Gene Ontology, Reactome, Wiki and NCI path- 
ways) with functional associations and connecting these 
data with molecular function (Interpro) and protein com- 
plex information (Corum) [63]. This tool creates pathway 
lists and highly interactive function maps, which can also 
be downloaded and visualized in cytoscape. A study of the 
targets of cullin-ring dependent ubiquitination revealed 
that a large fraction of the observed proteins become mod- 
ified upon activity of the SCF complex [64]. Analysis of the 
obtained list of SCF regulated proteins by cytoscape 
revealed a high degree of interconnectivity. 

Protein domain and motif analysis 

When working in not yet or just recently-sequenced 
organisms, data bases might not contain the complete set 
of protein descriptions. Similarly, proteins of unknown 
function might also be identified from highly curated data- 
bases of well studied organisms. Those proteins often lack 
the previously described information on interactions and 
pathway affiliations so that they would not be found in 
such studies. To learn more about the function of those 
proteins and how they interact with members of certain 
pathways, it is helpful to analyze their amino acid 
sequence for specific folds of protein domains or for 
motifs for post-translational modifications. The simplest 
analysis represents a BLAST search against the database of 
known protein sequences to find if proteins with similar 
amino acid sequences have been described in other organ- 
isms [65]. Further, the amino acid sequence can be ana- 
lyzed by programs such as Pfam, Interpro, SMART or also 
DAVID, to learn if the identified protein shares a specific 
protein fold with other proteins [39,66-68]. These algo- 
rithms apply hidden Markov models (HMMs) to classify 
proteins on basis of their amino acid sequence and predict 
the occurrence of a specific protein domain. Knowing 
about the abundance of a specific fold, could help to 
implement unknown proteins into biological networks. 
Secondly, algorithms such as MotifX or PhosphoMotif 
Finder analyze the sequence environment of post- 
translational modification sites [69,70], thereby reporting 
enrichment of certain amino acid motifs which can help 
to identify the modifying enzyme. 

Conclusions 

The development of methods to systematically study all 
proteins in a cell and their subsequent functional annota- 
tion opens up new pathways of research. In the future it is 
very likely that such studies will uncover new principles of 
how biological systems operate hopefully leading to an 
improved treatment of human pathologies. 



Over the last ten years the analytical harware has 
reached a level of sophistication of a more mature scientifc 
field. However, the bioinformatic interpretation and the 
processing of the data are still in its infancy. Besides reli- 
able and robust algorithms, international standards for 
data processing and deposition as well as their interpreta- 
tion have to be developed and agreed upon in order to 
unleash the full potential of proteomic research. 
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